METHODS

Study Population and Data Source

Data were obtained from the Surveillance, Epidemiology, and End Results (SEER) Research Database for colon cancer patients (ICD-O-3 codes C18.0-C18.8) diagnosed 2010-2020. The study population consisted of patients with complete data on race/ethnicity, income category, age, sex, cancer stage, and survival status. Complete case analysis was used; patients with missing values on age (n=845) or income category (n=4) were excluded, resulting in an analytical sample of n=181,264.

Variables and Operationalization

Outcome Variable - Cancer-Specific Survival: - Measured as months from cancer diagnosis to cancer-specific death or censoring at last follow-up - Death classification: “Dead (attributable to this cancer dx)” coded as event=1, all others coded as event=0 - Time variable derived from SEER survival_months field

Primary Exposure 1 - Race/Ethnicity: - Original SEER variable: race_and_origin_recode_nhw_nhb_nhaian_nhapi_hispanic - Recoded into 6 categories: - “Non-Hispanic White” - NHW (reference) - “Non-Hispanic Black” - NHB - “Hispanic (All Races)” - Hispanic - “Non-Hispanic Asian or Pacific Islander” - API - “Non-Hispanic American Indian/Alaska Native” - AIAN - All other categories - Other - Stored as ordered factor with NHW as reference level

Primary Exposure 2 - County-Level Income: - Original SEER variable: median_household_income_inflation_adj_to_2021 - Recoded into 5 ordered categories based on income thresholds: - “< $35,000”, “$35,000 - $39,999”, “$40,000 - $44,999” - Low income (reference) - “$45,000 - $49,999”, “$50,000 - $54,999” - Lower-middle - “$55,000 - $59,999”, “$60,000 - $64,999” - Middle - “$65,000 - $69,999”, “$70,000 - $74,999” - Upper-middle - “$75,000+” - High income - Stored as ordered factor with Low income as reference level

Covariates: - Age: Recoded from age_recode_with_1_year_olds into numeric midpoint values (18.5 to 87 years) - Sex: Recoded from original sex variable into binary factor (Male [reference], Female) - Cancer Stage: Derived from combined_summary_stage_2004, coded as ordered factor (Localized [reference], Regional, Distant) - Year of Diagnosis: Derived from year of diagnosis variable, entered as continuous numeric variable

Statistical Analysis

Primary Objective 1 - Association between Race/Ethnicity and Colon Cancer Survival: Kaplan-Meier curves with log-rank tests were generated to visualize and compare survival by racial/ethnic groups. Restricted Mean Survival Time (RMST) was calculated for each racial/ethnic group within a 60-month window to estimate differences in average survival.

Primary Objective 2 - Association between County-Level Poverty and Colon Cancer Survival: Kaplan-Meier curves with log-rank tests were generated to visualize and compare survival by income categories. RMST was calculated for each income category to quantify differences in survival.

Secondary Objective - Effect Modification: A Cox proportional hazards model with race × income interaction terms was fit to assess whether county-level income modifies the race/ethnicity–mortality association. The proportional hazards (PH) assumption was tested using Schoenfeld residuals (cox.zph) for all Cox models.

Preliminary Analyses: Descriptive statistics summarized the study population by race/ethnicity, income category, and cancer stage. Main effects Cox model (without interactions) was fit to estimate adjusted associations, controlling for age, sex, stage, and year of diagnosis.

Software: R version 4.4.2 with packages survival, survminer, ggplot2, dplyr, and kableExtra.


DATA LOADING AND PREPARATION

library(survival)
library(survminer)
library(tidyverse)
library(dplyr)
library(janitor)
library(ggplot2)
library(gridExtra)
library(kableExtra)

theme_set(theme_minimal() + 
          theme(panel.grid.major = element_blank(),
                panel.border = element_rect(color = "black", fill = NA),
                plot.title = element_text(hjust = 0.5, face = "bold", size = 13),
                axis.title = element_text(face = "bold", size = 11)))
seer <- read.csv("/Users/GAGANV/Desktop/ME/seerfinal_export.csv")
seer <- seer %>% clean_names()

cat("Dataset dimensions:", nrow(seer), "rows x", ncol(seer), "columns\n")

Dataset dimensions: 182113 rows x 15 columns

Race/Ethnicity Recoding

seer <- seer %>%
  mutate(
    race_ethnicity = case_when(
      race_and_origin_recode_nhw_nhb_nhaian_nhapi_hispanic == "Non-Hispanic White" ~ "NHW",
      race_and_origin_recode_nhw_nhb_nhaian_nhapi_hispanic == "Non-Hispanic Black" ~ "NHB",
      race_and_origin_recode_nhw_nhb_nhaian_nhapi_hispanic == "Hispanic (All Races)" ~ "Hispanic",
      race_and_origin_recode_nhw_nhb_nhaian_nhapi_hispanic == "Non-Hispanic Asian or Pacific Islander" ~ "API",
      race_and_origin_recode_nhw_nhb_nhaian_nhapi_hispanic == "Non-Hispanic American Indian/Alaska Native" ~ "AIAN",
      TRUE ~ "Other"
    ),
    race_ethnicity = factor(race_ethnicity, 
                            levels = c("NHW", "NHB", "Hispanic", "API", "AIAN", "Other"))
  )
Race/Ethnicity N Percent
NHW 116272 63.85
NHB 22769 12.50
Hispanic 24270 13.33
API 16172 8.88
AIAN 1501 0.82
Other 1129 0.62

Income Category Recoding

seer <- seer %>%
  mutate(
    income_cat = case_when(
      median_household_income_inflation_adj_to_2021 %in% c("< $35,000", "$35,000 - $39,999", "$40,000 - $44,999") ~ "Low income",
      median_household_income_inflation_adj_to_2021 %in% c("$45,000 - $49,999", "$50,000 - $54,999") ~ "Lower-middle",
      median_household_income_inflation_adj_to_2021 %in% c("$55,000 - $59,999", "$60,000 - $64,999") ~ "Middle",
      median_household_income_inflation_adj_to_2021 %in% c("$65,000 - $69,999", "$70,000 - $74,999") ~ "Upper-middle",
      median_household_income_inflation_adj_to_2021 == "$75,000+" ~ "High income",
      TRUE ~ NA_character_
    ),
    income_cat = factor(income_cat, 
                       levels = c("Low income", "Lower-middle", "Middle", "Upper-middle", "High income"),
                       ordered = TRUE)
  )
Income Category N Percent
Low income 13515 7.42
Lower-middle 19898 10.93
Middle 33579 18.44
Upper-middle 44119 24.23
High income 70998 38.99
NA 4 0.00

Stage, Age, and Survival Variables

seer <- seer %>%
  mutate(
    stage = factor(combined_summary_stage_2004, 
                   levels = c("Localized", "Regional", "Distant"),
                   ordered = TRUE),
    age_numeric = case_when(
      age_recode_with_1_year_olds == "18-19 years" ~ 18.5,
      age_recode_with_1_year_olds == "20-24 years" ~ 22,
      age_recode_with_1_year_olds == "25-29 years" ~ 27,
      age_recode_with_1_year_olds == "30-34 years" ~ 32,
      age_recode_with_1_year_olds == "35-39 years" ~ 37,
      age_recode_with_1_year_olds == "40-44 years" ~ 42,
      age_recode_with_1_year_olds == "45-49 years" ~ 47,
      age_recode_with_1_year_olds == "50-54 years" ~ 52,
      age_recode_with_1_year_olds == "55-59 years" ~ 57,
      age_recode_with_1_year_olds == "60-64 years" ~ 62,
      age_recode_with_1_year_olds == "65-69 years" ~ 67,
      age_recode_with_1_year_olds == "70-74 years" ~ 72,
      age_recode_with_1_year_olds == "75-79 years" ~ 77,
      age_recode_with_1_year_olds == "80-84 years" ~ 82,
      age_recode_with_1_year_olds == "85+ years" ~ 87,
      TRUE ~ NA_real_
    ),
    time = survival_months,
    event = ifelse(seer_cause_specific_death_classification == "Dead (attributable to this cancer dx)", 1, 0),
    sex_binary = factor(sex, levels = c("Male", "Female"))
  )

cat("Survival Variables:\n")

Survival Variables:

cat("  Median follow-up:", median(seer$time, na.rm = TRUE), "months\n")

Median follow-up: 30 months

cat("  Events:", sum(seer$event, na.rm = TRUE), "deaths\n")

Events: 54688 deaths

cat("  Event rate:", round(100*mean(seer$event, na.rm = TRUE), 2), "%\n")

Event rate: 30.03 %


DESCRIPTIVE STATISTICS

By Race/Ethnicity

Race/Ethnicity N % Deaths Event Rate % Median FU (mo) Mean Age
NHW 116272 63.8 35113 30.2 31 66.9
NHB 22769 12.5 7969 35.0 27 62.6
Hispanic 24270 13.3 6748 27.8 28 61.5
API 16172 8.9 4382 27.1 32 65.0
AIAN 1501 0.8 422 28.1 29 62.6
Other 1129 0.6 54 4.8 39 60.7

By Income Category

Income Category N % Deaths Event Rate % Median FU (mo) Mean Age
Low income 13515 7.4 4607 34.1 32.0 65.0
Lower-middle 19898 10.9 6467 32.5 31.0 65.8
Middle 33579 18.4 11018 32.8 35.0 65.6
Upper-middle 44119 24.2 13916 31.5 34.0 65.5
High income 70998 39.0 18679 26.3 25.0 65.3
NA 4 0.0 1 25.0 15.5 75.8

By Cancer Stage

Stage N % Deaths Event Rate % Median FU (mo) Mean Age
Localized 67376 37 4919 7.3 48 65.6
Regional 67345 37 15523 23.0 36 65.8
Distant 47392 26 34246 72.3 10 64.6

KAPLAN-MEIER CURVES

By Race/Ethnicity

km_race <- survfit(Surv(time, event) ~ race_ethnicity, data = seer)
logrank_race <- survdiff(Surv(time, event) ~ race_ethnicity, data = seer)

## 
## Log-rank Test (Race/Ethnicity) χ² = 651.68 , p < 0.0001

By Income Category

km_income <- survfit(Surv(time, event) ~ income_cat, data = seer)
logrank_income <- survdiff(Surv(time, event) ~ income_cat, data = seer)

## 
## Log-rank Test (Income) χ² = 236.67 , p < 0.0001

By Cancer Stage

km_stage <- survfit(Surv(time, event) ~ stage, data = seer)
logrank_stage <- survdiff(Surv(time, event) ~ stage, data = seer)

## 
## Log-rank Test (Stage) χ² = 89810.31 , p < 0.0001

COX PROPORTIONAL HAZARDS MODELS

Main Effects Model

cox_model1 <- coxph(
  Surv(time, event) ~ race_ethnicity + income_cat + age_numeric + sex_binary + 
                       stage + year_of_diagnosis,
  data = seer
)

Model Formula: Surv(time, event) ~ race_ethnicity + income_cat + age_numeric + sex_binary + stage + year_of_diagnosis

Sample Size: N = 181264 , Events = 54655

Variable Coefficient HR 95% CI Lower 95% CI Upper P-value
race_ethnicityNHB race_ethnicityNHB 0.1659 1.1805 1.1516 1.2100 1.5e-39
race_ethnicityHispanic race_ethnicityHispanic 0.0272 1.0275 1.0008 1.0550 4.4e-02
race_ethnicityAPI race_ethnicityAPI -0.0946 0.9098 0.8812 0.9392 6.0e-09
race_ethnicityAIAN race_ethnicityAIAN 0.0631 1.0652 0.9675 1.1727 2.0e-01
race_ethnicityOther race_ethnicityOther -1.0187 0.3611 0.2764 0.4716 7.7e-14
income_cat.L income_cat.L -0.1196 0.8873 0.8671 0.9080 2.5e-24
income_cat.Q income_cat.Q -0.0334 0.9672 0.9465 0.9883 2.4e-03
income_cat.C income_cat.C -0.0313 0.9692 0.9488 0.9901 4.0e-03
income_cat^4 income_cat^4 0.0177 1.0179 0.9979 1.0383 8.0e-02
age_numeric age_numeric 0.0329 1.0335 1.0328 1.0342 0.0e+00
sex_binaryFemale sex_binaryFemale -0.0639 0.9381 0.9224 0.9540 1.1e-13
stage.L stage.L 2.2589 9.5727 9.3680 9.7819 0.0e+00
stage.Q stage.Q 0.2614 1.2987 1.2758 1.3220 9.8e-182
year_of_diagnosis year_of_diagnosis -0.0128 0.9873 0.9843 0.9903 1.5e-16
Model Performance:
Concordance = 0.824

Proportional Hazards Assumption Test

ph_test <- cox.zph(cox_model1)
Variable χ² df p-value
race_ethnicity race_ethnicity 244.24 5 9.5e-51
income_cat income_cat 10.35 4 3.5e-02
age_numeric age_numeric 2075.66 1 0.0e+00
sex_binary sex_binary 64.70 1 8.7e-16
stage stage 160.57 2 1.4e-35
year_of_diagnosis year_of_diagnosis 9.18 1 2.4e-03
GLOBAL GLOBAL 2363.36 14 0.0e+00
plot(ph_test, main = "Schoenfeld Residuals: Proportional Hazards Assessment")


PRIMARY ANALYSIS: RESTRICTED MEAN SURVIVAL TIME

seer_complete <- seer %>%
  filter(!is.na(age_numeric), !is.na(income_cat), !is.na(stage), !is.na(time), !is.na(event)) %>%
  select(time, event, race_ethnicity, income_cat, age_numeric, sex_binary, stage, year_of_diagnosis)

cat("Analysis sample (complete cases):", nrow(seer_complete), "patients\n\n")

Analysis sample (complete cases): 181264 patients

tau <- 60
rmst_race_results <- list()

for (race in levels(seer_complete$race_ethnicity)) {
  seer_race <- seer_complete %>% filter(race_ethnicity == race)
  km_fit <- survfit(Surv(time, event) ~ 1, data = seer_race)
  
  times <- c(0, km_fit$time[km_fit$time <= tau])
  surv <- c(1, km_fit$surv[km_fit$time <= tau])
  rmst <- sum(diff(times) * (surv[-length(surv)] + surv[-1]) / 2)
  
  rmst_race_results[[race]] <- list(
    RMST = rmst,
    N = nrow(seer_race),
    Events = sum(seer_race$event)
  )
}

rmst_table <- data.frame(
  Race_Ethnicity = names(rmst_race_results),
  N = sapply(rmst_race_results, "[[", "N"),
  Deaths = sapply(rmst_race_results, "[[", "Events"),
  RMST_months = round(sapply(rmst_race_results, "[[", "RMST"), 2)
)

nhw_rmst <- rmst_table$RMST_months[rmst_table$Race_Ethnicity == "NHW"]
rmst_table$Diff_vs_NHW <- round(rmst_table$RMST_months - nhw_rmst, 2)

rmst_display <- rmst_table
names(rmst_display) <- c("Race/Ethnicity", "N", "Deaths", "RMST (months)", "Difference vs NHW")

RMST by Race/Ethnicity

Race/Ethnicity N Deaths RMST (months) Difference vs NHW
NHW NHW 115751 35100 44.79 0.00
NHB NHB 22727 7963 42.48 -2.31
Hispanic Hispanic 24037 6739 45.71 0.92
API API 16139 4378 46.69 1.90
AIAN AIAN 1496 421 45.22 0.43
Other Other 1114 54 57.43 12.64

RMST by Income Category

rmst_income_results <- list()

for (inc in levels(seer_complete$income_cat)) {
  seer_inc <- seer_complete %>% filter(income_cat == inc)
  km_fit <- survfit(Surv(time, event) ~ 1, data = seer_inc)
  
  times <- c(0, km_fit$time[km_fit$time <= tau])
  surv <- c(1, km_fit$surv[km_fit$time <= tau])
  rmst <- sum(diff(times) * (surv[-length(surv)] + surv[-1]) / 2)
  
  rmst_income_results[[inc]] <- list(
    RMST = rmst,
    N = nrow(seer_inc),
    Events = sum(seer_inc$event)
  )
}

rmst_income_table <- data.frame(
  Income_Category = names(rmst_income_results),
  N = sapply(rmst_income_results, "[[", "N"),
  Deaths = sapply(rmst_income_results, "[[", "Events"),
  RMST_months = round(sapply(rmst_income_results, "[[", "RMST"), 2)
)

low_rmst <- rmst_income_table$RMST_months[rmst_income_table$Income_Category == "Low income"]
rmst_income_table$Diff_vs_Low <- round(rmst_income_table$RMST_months - low_rmst, 2)

rmst_income_display <- rmst_income_table
names(rmst_income_display) <- c("Income Category", "N", "Deaths", "RMST (months)", "Difference vs Low")
Income Category N Deaths RMST (months) Difference vs Low
Low income Low income 13479 4604 43.31 0.00
Lower-middle Lower-middle 19843 6464 43.92 0.61
Middle Middle 33464 11012 44.27 0.96
Upper-middle Upper-middle 43911 13904 44.71 1.40
High income High income 70567 18671 45.88 2.57

SECONDARY OBJECTIVE: EFFECT MODIFICATION BY INCOME

Interaction Model: Race × Income

cox_model2 <- coxph(
  Surv(time, event) ~ race_ethnicity * income_cat + age_numeric + sex_binary + 
                       stage + year_of_diagnosis,
  data = seer
)
Model Parameters χ² P-value
Main Effects 14 NA NA
Race × Income 34 32.94 0.0343
Note:
LRT χ² = 32.94 on 20 df; p = 0.0343

Interaction Coefficients

Interaction Term Coefficient HR 95% CI Lower 95% CI Upper P-value
race_ethnicityNHB:income_cat.L race_ethnicityNHB:income_cat.L 0.0481 1.0493 0.9901 1.1120 1.0e-01
race_ethnicityHispanic:income_cat.L race_ethnicityHispanic:income_cat.L -0.0237 0.9766 0.8865 1.0758 6.3e-01
race_ethnicityAPI:income_cat.L race_ethnicityAPI:income_cat.L -0.0633 0.9386 0.6687 1.3176 7.1e-01
race_ethnicityAIAN:income_cat.L race_ethnicityAIAN:income_cat.L -0.2948 0.7447 0.5994 0.9252 7.8e-03
race_ethnicityOther:income_cat.L race_ethnicityOther:income_cat.L -0.0857 0.9179 0.3016 2.7930 8.8e-01
race_ethnicityNHB:income_cat.Q race_ethnicityNHB:income_cat.Q -0.0251 0.9752 0.9219 1.0315 3.8e-01
race_ethnicityHispanic:income_cat.Q race_ethnicityHispanic:income_cat.Q 0.0118 1.0119 0.9277 1.1037 7.9e-01
race_ethnicityAPI:income_cat.Q race_ethnicityAPI:income_cat.Q -0.0092 0.9908 0.7412 1.3246 9.5e-01
race_ethnicityAIAN:income_cat.Q race_ethnicityAIAN:income_cat.Q 0.2502 1.2843 1.0235 1.6116 3.1e-02
race_ethnicityOther:income_cat.Q race_ethnicityOther:income_cat.Q 0.7804 2.1822 0.7865 6.0549 1.3e-01
race_ethnicityNHB:income_cat.C race_ethnicityNHB:income_cat.C -0.0538 0.9476 0.8944 1.0040 6.8e-02
race_ethnicityHispanic:income_cat.C race_ethnicityHispanic:income_cat.C -0.0663 0.9359 0.8637 1.0140 1.1e-01
race_ethnicityAPI:income_cat.C race_ethnicityAPI:income_cat.C 0.0589 1.0607 0.8604 1.3076 5.8e-01
race_ethnicityAIAN:income_cat.C race_ethnicityAIAN:income_cat.C 0.0387 1.0394 0.8136 1.3280 7.6e-01
race_ethnicityOther:income_cat.C race_ethnicityOther:income_cat.C -1.0474 0.3509 0.0911 1.3518 1.3e-01
race_ethnicityNHB:income_cat^4 race_ethnicityNHB:income_cat^4 0.0140 1.0141 0.9603 1.0709 6.1e-01
race_ethnicityHispanic:income_cat^4 race_ethnicityHispanic:income_cat^4 0.0899 1.0941 1.0245 1.1685 7.4e-03
race_ethnicityAPI:income_cat^4 race_ethnicityAPI:income_cat^4 -0.0996 0.9052 0.7947 1.0311 1.3e-01
race_ethnicityAIAN:income_cat^4 race_ethnicityAIAN:income_cat^4 0.0210 1.0213 0.7949 1.3120 8.7e-01
race_ethnicityOther:income_cat^4 race_ethnicityOther:income_cat^4 0.2389 1.2699 0.4184 3.8540 6.7e-01
Note:
Interaction terms represent how the effect of race/ethnicity varies by income category

RESULTS

Among 181,264 colon cancer patients from SEER (2010-2020) with median follow-up of 30 months and 54,688 cancer-specific deaths, significant health disparities in survival were observed by race/ethnicity and county-level income.

Primary Objective 1 - Race/Ethnicity Association: Non-Hispanic Black (NHB) patients demonstrated significantly worse survival compared to Non-Hispanic White (NHW) patients (HR = 1.18, 95% CI: 1.15-1.21, p < 0.0001 in Cox model). Using RMST as the primary analysis method (due to proportional hazards violation, χ² = 2363.36, p < 0.0001), NHB patients had 42.48 months survival compared to 44.79 months for NHW, representing a loss of 2.31 months. Asian/Pacific Islander patients demonstrated favorable outcomes (HR = 0.91, RMST = 46.69 months, +1.90 months difference). Hispanic patients showed borderline excess hazard (HR = 1.03) with RMST of 45.71 months.

Primary Objective 2 - Income Association: A clear socioeconomic gradient in survival was evident. High-income patients achieved 45.88 months RMST compared to 43.31 months for low-income patients, a difference of 2.57 months (6% longer survival). The linear trend across income categories was significant (HR = 0.89 per income step, p < 0.0001 in Cox model), indicating that each income category elevation was associated with an 11% reduction in hazard of death.

Secondary Objective - Effect Modification: The race × income interaction model was statistically significant (LRT: χ² = 32.94 on 20 df, p = 0.0343), indicating that income modifies the race/ethnicity–mortality association. However, the improvement in model fit was modest (20 additional parameters yielded χ² gain of 32.94), and the interaction coefficients were generally not statistically significant (p > 0.05 for most terms), suggesting that race and income effects are largely independent rather than synergistic. Cancer stage was the strongest predictor overall (HR = 9.57 for localized vs. distant, p < 0.0001). The Cox main effects model demonstrated excellent discriminatory accuracy (Concordance = 0.824).


DISCUSSION

Key Findings

This analysis of SEER data documents significant and clinically meaningful health disparities in colon cancer-specific survival by both race/ethnicity and county-level income. Non-Hispanic Black patients experienced an 18% excess hazard of cancer specific death, translating to a 2.3-month reduction in 5-year survival compared to Non-Hispanic White patients. Income based disparities were consistent, with high-income patients gaining 2.57 months of survival (6%) compared to low-income patients, representing a clear dose response relationship.

The secondary analysis revealed that county-level income statistically modifies the race/ethnicity mortality association, though the magnitude of this effect modification is modest and most interaction terms are not statistically significant. This suggests that race and income operate as largely independent predictors of survival.

Interpretation

The racial disparity in survival is substantial and persists after adjustment for age, sex, cancer stage, and year of diagnosis. The income based gradient suggests that socioeconomic factors whether directly or as proxies for healthcare access, quality, comorbidities, or health literacy substantially impact cancer outcomes. The modest interaction suggests that targeted interventions should address both racial/ethnic minorities and low-income patients, but may not need to focus specifically on race, income combinations.

Implications

Clinically, oncology teams should recognize that both NHB patients and low income patients face substantial survival disadvantages and may benefit from enhanced surveillance and supportive care protocols. Health policy should address both racial/ethnic and socioeconomic barriers to equitable cancer care.


LIMITATIONS

This study, while providing valuable insights into health disparities in colon cancer survival, has several important limitations:

  1. Proportional Hazards Assumption Violation: The Cox proportional hazards assumption was violated globally (χ² = 2363.36, p < 0.0001), necessitating RMST as the primary analysis. Results should not be extrapolated beyond 60 months.

  2. County-Level vs. Individual-Level Socioeconomic Status: Income was assessed at the county median household level rather than individual patient level, subject to ecological fallacy.

  3. Lack of Treatment Data: SEER does not consistently record chemotherapy, radiation, or surgical details, limiting mechanistic understanding.

  4. Missing Comorbidity Information: Comorbid conditions are not captured in SEER.

  5. Selection Bias from Complete Case Analysis: Complete case analysis assumes data missing completely at random (MCAR).

  6. Regional Data Limitations: SEER represents approximately 35% of the US population and may not be nationally representative.

  7. Cancer-Specific vs. All-Cause Mortality: This analysis examined cancer-specific death only; competing mortality was not captured.

  8. “Other” Race Category: The “Other” category included only 1,129 patients with 54 events, resulting in imprecise estimates.

  9. Temporal Generalization: These findings cover 2010-2020 and may not reflect current disparities.

  10. Unmeasured Confounding: Unmeasured factors such as provider implicit bias, patient trust, language barriers, insurance stability, and tumor biology may explain observed disparities.

  11. Missing Mechanism Data: We cannot determine whether disparities result from differences in screening, treatment intensity, adherence, or surveillance.

  12. Modest Effect Modification: While the interaction term is statistically significant, the modest effect size and non-significant individual interaction coefficients suggest limited practical effect modification.


CONCLUSION

This analysis of SEER data documents significant and clinically meaningful health disparities in colon cancer specific survival by race/ethnicity and county-level income. Non-Hispanic Black patients experienced an 18% excess hazard of cancer specific death, with a 2.3 month survival deficit over 5 years. Income showed a consistent dose-response relationship, with high income patients gaining 2.57 months of survival versus low income patients. County-level income statistically modifies the race/ethnicity mortality association, though the effect is modest and most interaction terms are not significant, suggesting largely independent effects. These findings underscore the need for targeted interventions addressing both racial/ethnic and socioeconomic disparities in cancer outcomes. Future research incorporating treatment data, individual level SES, and detailed mechanistic analyses is needed to identify modifiable drivers of these disparities.


SESSION INFORMATION

sessionInfo()

R version 4.4.2 (2024-10-31) Platform: aarch64-apple-darwin20 Running under: macOS 26.1

Matrix products: default BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0

locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Chicago tzcode source: internal

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] kableExtra_1.4.0 gridExtra_2.3 janitor_2.2.1 lubridate_1.9.3 [5] forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4 purrr_1.0.4
[9] readr_2.1.5 tidyr_1.3.1 tibble_3.2.1 tidyverse_2.0.0 [13] survminer_0.5.1 ggpubr_0.6.1 ggplot2_4.0.0 survival_3.7-0

loaded via a namespace (and not attached): [1] gtable_0.3.6 xfun_0.51 bslib_0.8.0 rstatix_0.7.2
[5] lattice_0.22-6 tzdb_0.4.0 vctrs_0.6.5 tools_4.4.2
[9] generics_0.1.3 fansi_1.0.6 pkgconfig_2.0.3 Matrix_1.7-1
[13] data.table_1.17.8 RColorBrewer_1.1-3 S7_0.2.0 lifecycle_1.0.4
[17] compiler_4.4.2 farver_2.1.2 textshaping_0.4.0 snakecase_0.11.1
[21] carData_3.0-5 litedown_0.6 htmltools_0.5.8.1 sass_0.4.9
[25] yaml_2.3.10 Formula_1.2-5 pillar_1.9.0 car_3.1-3
[29] jquerylib_0.1.4 cachem_1.1.0 abind_1.4-8 km.ci_0.5-6
[33] commonmark_1.9.5 tidyselect_1.2.1 digest_0.6.37 stringi_1.8.4
[37] labeling_0.4.3 splines_4.4.2 fastmap_1.2.0 grid_4.4.2
[41] cli_3.6.3 magrittr_2.0.3 utf8_1.2.4 broom_1.0.8
[45] withr_3.0.2 scales_1.4.0 backports_1.5.0 timechange_0.3.0
[49] rmarkdown_2.29 ggtext_0.1.2 ggsignif_0.6.4 zoo_1.8-13
[53] hms_1.1.3 evaluate_1.0.1 knitr_1.49 KMsurv_0.1-6
[57] viridisLite_0.4.2 markdown_2.0 survMisc_0.5.6 rlang_1.1.6
[61] Rcpp_1.1.0 gridtext_0.1.5 xtable_1.8-4 glue_1.8.0
[65] xml2_1.3.6 svglite_2.2.1 rstudioapi_0.17.1 jsonlite_1.8.9
[69] R6_2.5.1 systemfonts_1.3.1


Analysis Date: December 03, 2025
Study: SEER Colon Cancer Survival Analysis
Author: Gagan Vijay
Supervisory Faculty: Dr. Kim, Washington University in St. Louis